Finding trials registered on ClinicalTrials.gov that do not have reported results

Reporting of clinical trial results became mandatory for many trials in 2008. However this paper and this investigation both find that substantial numbers of clinical trials have not reported results, even for those trials where the FDAAA has made reporting mandatory.

This notebook examines how many trials on ClinicalTrials.gov have had their results publicly reported. We have a broader definition of a trial that should report its results than the FDAAA. We count a trial as eligible for our analysis if:

  • it has overall status of 'Completed'
  • it has a study type of 'Interventional'
  • its completion date was after 1 Jan 2006, but is more than 24 months ago
  • it is phase 2 or later (or its phase is N/A, ie it's a trial of a device or a behavioural intervention)
  • it has no results disposition date (i.e. no application to delay results has been filed).

We then classify it as overdue if it has no summary results attached on ClinicalTrials.gov, and no results on PubMed that are linked by NCT ID (see below).

This is substantially broader than FDAAA, which covers only US-based trials of FDA-approved drugs. However, we think all trials should report their results, not just US-based trials, or FDA-approved drugs. In addition, FDAAA requires results to be reported within 12 months of completion, and we allow 24 months.

ClinicalTrials.gov supplies notes on how to find studies with results and results in general.


In [1]:
import csv
from datetime import datetime
from dateutil.relativedelta import relativedelta
import glob
from pprint import pprint
from slugify import slugify

import sqlite3
import numpy as np
import pandas as pd

import utils

Create summary results file

The raw XML trial summaries from ClinicalTrials.gov are supplied as a single very large zip file, containing more than 200,000 XML files. This section assumes that that these have already been downloaded and unzipped in the search_result directory.

Extract the fields of interest from the XML summaries, and save them to a CSV file, which we'll use as our source data for the rest of this exercise. ClinicalTrials.gov supplies field definitions.

Toggle REGENERATE_SUMMARY to False for the purposes of development.


In [2]:
fname = './data/trials.csv'
REGENERATE_SUMMARY = False # False
if REGENERATE_SUMMARY:
    files = glob.glob('./search_result/*.xml')
    print len(files), 'files found'
    fieldnames = ['nct_id', 'title', 'overall_status', 
                  'study_type', 'completion_date',
                  'lead_sponsor', 'lead_sponsor_class',
                  'collaborator', 'collaborator_class', 
                  'phase', 'locations', 'has_drug_intervention', 'drugs', 
                  'disposition_date', 'results_date', 
                  'enrollment']
    trials = csv.DictWriter(open(fname, 'wb'), fieldnames=fieldnames)
    trials.writeheader()
    for i, f in enumerate(files):
        if i % 50000 == 0:
            print i, f
        text = open(f, 'r').read()
        data = utils.extract_ctgov_xml(text)
        trials.writerow(data)
print 'done'


done

Load data for analysis

Load into Pandas, normalising the date and phase fields. NB: If this produces a weird EOF error (which it does intermittently), delete the last line of the file manually. We will have to live with one missing trial.


In [17]:
dtype = {'has_drug_intervention': bool, 
         'phase': str} 
converters = {'enrollment': lambda x: x and int(x) or 0} 
datefields = ['completion_date', 'results_date', 'disposition_date']
df = pd.read_csv(fname,
                 parse_dates=datefields, 
                 infer_datetime_format=True,
                 keep_default_na=False,
                 na_values=None,
                 converters=converters,
                 dtype=dtype)

In [18]:
df['phase_normalised'] = df.phase.apply(utils.normalise_phase)

In [19]:
df.tail()


Out[19]:
nct_id title overall_status study_type completion_date lead_sponsor lead_sponsor_class collaborator collaborator_class phase locations has_drug_intervention drugs disposition_date results_date enrollment phase_normalised
240369 NCT00683956 Evaluation of a Closed-Loop Control System for... Completed Interventional 2009-07-01 Stanford University Other 1 United States True Propofol; NaT NaT 35 1
240370 NCT01540734 Pharmacokinetic Study Investigating the Extent... Completed Interventional 2009-12-01 GlaxoSmithKline Industry 1 United States True Marketed paracetamol; Experimental paracetamol... NaT NaT 28 1
240371 NCT02691429 Applicability of the Acai Fruit (Euterpe Olera... Recruiting Interventional 2016-11-01 Federal University of São Paulo Other Rafael R. Caiado Cristiane S. Peris Acácio Alv... Other Other Other Other Other Other Other Othe... N/A Brazil False NaT NaT 24 5
240372 NCT02947750 Neurovascular Transduction During Exercise in ... Recruiting Interventional 2021-11-01 Emory University Other National Heart, Lung, and Blood Institute (NHLBI) NIH 2 United States True 6R-BH4; NaT NaT 150 2
240373 NCT00005693 Computer Assisted Instruction Weight Managemen... Completed Observational NaT National Heart, Lung, and Blood Institute (NHLBI) NIH N/A False NaT NaT 0 5

Calculate whether trials are completed

The criteria for counting a trial as completed are defined above. Print some summary stats about completed trials.


In [20]:
startdate = datetime.strptime('01 January 2006', '%d %B %Y')
cutoff = datetime.now() - relativedelta(years=2)
print 'Cutoff date', cutoff

df['is_completed'] = (df.overall_status == 'Completed') & \
    (df.study_type.str.startswith('Interventional')) & \
    (df.completion_date >= startdate) & \
    (df.completion_date <= cutoff) & \
    (df.phase_normalised >= 2) & \
    (df.disposition_date.isnull())
df['is_overdue'] = (df.is_completed & \
                    df.results_date.isnull())
df_completed = df[df.is_completed] 
df_overdue = df[df.is_completed & df.results_date.isnull()]

print len(df), 'total trials found'
print len(df[~df.disposition_date.isnull()]), 'trials have dispositions filed'
print len(df_completed), 'are completed and due results, by our definition'
print len(df[df.is_completed & ~df.results_date.isnull()]), \
    'trials due results have submitted results on clinicaltrials.gov'
print len(df_overdue), \
    'trials due results have not submitted results on clinicaltrials.gov'


Cutoff date 2015-05-03 11:15:15.882637
240374 total trials found
3621 trials have dispositions filed
54301 are completed and due results, by our definition
16101 trials due results have submitted results on clinicaltrials.gov
38200 trials due results have not submitted results on clinicaltrials.gov

Check for results on PubMed

If trials have reported their results on PubMed, and if it's possible to find them on PubMed using a linked NCT ID, then we count those trials as having submitted results.

So, for all trials that we regard as completed and due results, and that haven't already reported results on clinicaltrials.gov, we search PubMed, looking for the NCT ID either as a Secondary Source ID, or in the title/abstract. We look for anything published between the completion date and now, that doesn't have the words "study protocol" in the title, and that is classified as results of a trial (using the "therapy" clinical keyword, broad version).

At the time of writing, about 9,000 of the 34,000 trials have results on PubMed. An example of an NCT ID with results on PubMed: NCT02460380. (TODO: Update this).

Note 1: we know from the BMJ paper that there are trials that do have results on PubMed, but that aren't linked using the NCT ID. The BMJ authors found these using a manual search. Some examples: NCT00002762: 19487378, NCT00002879: 18470909, NCT00003134: 19066728, NCT00003596: 18430910. We regard these as invalid, because you can only find results via an exhaustive manual search. We only count results as published for our purposes if they are either (i) submitted on ClinicalTrials.gov or (ii) retrievable on PubMed using the NCT ID. See more on this below.

Note 2: we know there are some trials that have results PMIDs directly in ClinicalTrials.gov, in the results_reference field of the XML. After discussion with Jess here, and Annice at ClinicalTrials.gov, I decided that these results are too often meaningless to be useful - lots of the time they aren't truly results, but are studies from years ago.

Note 3: we also experimented with retrieving the results using the narrow version of the "therapy" clinical keyword, and using no clinical keyword at all. We evaluated these by using multiple PubMed matches as surrogate measures for false identification. At the time of writing on 2016/10/24, we examined 34677 trial registry IDs: the PubMed broad keyword yielded 7815 matches with 1706 multiple matches; the PubMed narrow keyword yielded 6448 matches with 1238 multiple matches, and using no keyword yielded 7981 matches with 1860 multiple matches. We chose the broad keyword for our final results.


In [21]:
# Store results locally.
conn = sqlite3.connect('./data/trials-abstract.db')
cur = conn.cursor()  
c = "CREATE TABLE IF NOT EXISTS trials(nct_id TEXT PRIMARY KEY, "
c += "pubmed_results INT, pubmed_results_broad INT, pubmed_results_narrow INT)"
cur.execute(c)
conn.commit()

REGENERATE_PUBMED_LINKS = False
count = 0
df['pubmed_results'] = False
for i, row in df_overdue.iterrows():
    if count % 10000 == 0:
        print count, row.nct_id
    count += 1
    # First, check for results stored in the local db.
    c = "SELECT nct_id, pubmed_results, pubmed_results_broad, "
    c += "pubmed_results_narrow FROM trials WHERE nct_id='%s'" % row.nct_id
    cur.execute(c)
    data = cur.fetchone()
    has_results = False
    if data and (not REGENERATE_PUBMED_LINKS):
        has_results = bool(int(data[2]))
    else:
        # No local results, or we want to regenerate them: check PubMed.
        broad_results = \
            utils.get_pubmed_linked_articles(row.nct_id, 
                                             row.completion_date, 'broad')
        # Used in the past (see note 3 above).
        simple_results = \
            utils.get_pubmed_linked_articles(row.nct_id, 
                                             row.completion_date, '')
        narrow_results = \
            utils.get_pubmed_linked_articles(row.nct_id, 
                                             row.completion_date, 'narrow')
        c = "INSERT OR REPLACE INTO trials VALUES('%s', %s, %s, %s)" % \
            (row.nct_id, len(simple_results), len(broad_results), len(narrow_results))
        cur.execute(c)
        conn.commit()
        has_results = broad_results > 0
    df.set_value(i, 'pubmed_results', has_results) 
    
cur.close()
conn.close()
print 'done'


0 NCT00135811
10000 NCT00891358
20000 NCT02243930
30000 NCT01196546
done

 Calculate final overdue count

Now we have looked for PubMed results, we can calculate the final overdue count, and print some summary statistics.


In [22]:
# Reset dataframes now we have the results from PubMed.
df['is_overdue'] = (df.is_completed & df.results_date.isnull() & ~df.pubmed_results)
print 'How many of the unreported trials were found on PubMed:'
print df[df.is_completed & df.results_date.isnull()].pubmed_results.value_counts()
df_completed = df[df.is_completed]
df_overdue = df[df.is_overdue]


How many of the unreported trials were found on PubMed:
False    27794
True     10406
Name: pubmed_results, dtype: int64

In [25]:
# Print summary stats for the entire dataset. 
print len(df_completed), 'trials should have published results'
print len(df_overdue), 'trials have not published results'
percent_submitted = (1 - (len(df_overdue) / float(len(df_completed)))) * 100
print '%s%% of completed trials have published results' % \
    '{:,.2f}'.format(percent_submitted)
print int(df_overdue.enrollment.sum()), 'total patients are enrolled in overdue trials'


54301 trials should have published results
27794 trials have not published results
48.81% of completed trials have published results
30570515 total patients are enrolled in overdue trials

In [26]:
# Print summary stats for major trial sponsors only.
NUM_TRIALS = 30
df_major = df_completed[
    df_completed.groupby('lead_sponsor').nct_id.transform(len) >= NUM_TRIALS]
print len(df_major), 'trials by major sponsors should have published results'
print len(df_major[df_major.is_overdue]), 'trials by major sponsors have not published results'
percent_submitted = (1 - (len(df_major[df_major.is_overdue]) / float(len(df_major)))) * 100
print '%s%% of completed trials by major sponsors have published results' % \
    '{:,.2f}'.format(percent_submitted)
print int(df_major[df_major.is_overdue].enrollment.sum()), 'total patients are enrolled in overdue trials'


29377 trials by major sponsors should have published results
13266 trials by major sponsors have not published results
54.84% of completed trials by major sponsors have published results
5602426 total patients are enrolled in overdue trials

In [27]:
df_completed.groupby('lead_sponsor_class').sum()[['is_overdue', 'is_completed']]


Out[27]:
is_overdue is_completed
lead_sponsor_class
Industry 6280.0 16997.0
NIH 421.0 1185.0
Other 20778.0 35267.0
U.S. Fed 315.0 852.0

In [28]:
# Calculate publication rates by sector (raw data)
df_by_sector = df_completed.groupby('lead_sponsor_class').sum()[['is_overdue', 'is_completed']]
df_by_sector['percent_overdue'] = df_by_sector.is_overdue / df_by_sector.is_completed * 100
df_by_sector


Out[28]:
is_overdue is_completed percent_overdue
lead_sponsor_class
Industry 6280.0 16997.0 36.947697
NIH 421.0 1185.0 35.527426
Other 20778.0 35267.0 58.916267
U.S. Fed 315.0 852.0 36.971831

In [30]:
# Calculate publication rates by sector (major sponsors only)
df_major_gp = df_major.groupby('lead_sponsor_class').sum()[['is_overdue', 'is_completed']]
df_major_gp['percent_overdue'] = df_major_gp.is_overdue / df_major_gp.is_completed * 100
df_major_gp


Out[30]:
is_overdue is_completed percent_overdue
lead_sponsor_class
Industry 2474.0 9511.0 26.011986
NIH 345.0 1030.0 33.495146
Other 10330.0 18334.0 56.343406
U.S. Fed 117.0 502.0 23.306773

Write to CSV

Output final results to a CSV file, which we will use in the interactive. We reshape the data so it has a row for each sponsor, and two columns for each year: one column for the number of overdue results, and one for the total trials.

Also, write all the raw data to a single CSV file.


In [31]:
df_completed['year_completed'] = df_completed['completion_date'].dt.year.dropna().astype(int)
df_completed['year_completed'] = df_completed.year_completed.astype(int)

# Drop all sponsors with fewer than N completed trials.
df_final = df_completed[
    df_completed.groupby('lead_sponsor').nct_id.transform(len) >= NUM_TRIALS]

# Now reshape the data: a row for each sponsor, columns by year:
# lead_sponsor,2008_overdue,2008_total,2009_overdue,2009_total...
df_temp = df_final.set_index(['lead_sponsor', 'lead_sponsor_class', 'year_completed']) 
gb = df_temp.groupby(level=[0, 1, 2]).is_overdue
df2 = gb.agg({'overdue': 'sum', 'total': 'count'}) \
          .unstack().swaplevel(0, 1, 1).sort_index(1)
df2.columns = df2.columns.to_series().apply(lambda x: '{}_{}'.format(*x))
df3 = df2.reset_index()
df3['lead_sponsor_slug'] = df3.lead_sponsor.apply(slugify)


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [33]:
df3.to_csv('./data/completed.csv', index=None)
print len(df3), 'sponsors found with cutoff point at %s trials' % NUM_TRIALS

# Write the raw output to a full spreadsheet. 
df.to_csv('./data/all.csv', index=None)


323 sponsors found with cutoff point at 30 trials

For reference: Compare our results with BMJ authors

TODO: Make this a separate notebook?

A 2016 BMJ paper found that around 65% of papers reprted results. "Overall, 2892 of the 4347 clinical trials (66.5%) had been published or reported results as of July 2014."

Excellently, the BMJ authors publish their raw data on DataDryad so we can compare our results with theirs, to get an idea of the difference between our automated strategy and their partially manual strategy. (However, in their reported data it looks to me like the matched PMID rate is 59.9% of all NCT IDs.)

The BMJ authors were looking at a much smaller set of papers than us, because they focussed on academic medical centres. Their set is slightly different, because they include pre-Phase-2 trials, and 'Terminated' as well as 'Completed' trials. They also used a manual search strategy which involved searching Scopus and manually comparing results.


In [15]:
from openpyxl import load_workbook
import sys
bmj_results = load_workbook(filename = './data/chen-bmj.xlsx')


/usr/local/lib/python2.7/dist-packages/openpyxl/reader/worksheet.py:322: UserWarning: Data Validation extension is not supported and will be removed
  warn(msg)
/usr/local/lib/python2.7/dist-packages/openpyxl/workbook/names/named_range.py:125: UserWarning: Discarded range with reserved name
  warnings.warn("Discarded range with reserved name")

In [16]:
nct_ids = {}
count = 0
has_pmid = 0

# The Excel data has multiple worksheets, sigh. 
# And NCT IDs can occur more than once with different results, sigh.
# We only care about where there's at least one result.
# Fiddle about and reshape the data so that we know whether 
# each NCT ID has a result. 
for sheet in bmj_results.worksheets:
    for i, row in enumerate(sheet.rows):
        if i == 0:
            continue
        if row[0].value:
            count += 1
            if isinstance(row[6].value, long):
                val = str(row[6].value)
            else:
                val = row[6].value
            if val:
                has_pmid += 1
            # Always set val if it exists.
            # Otherwise, only set val if there's no current value
            # for this NCT ID.
            if val:
                nct_ids[row[0].value] = val
            else:
                if not row[0].value in nct_ids:
                    nct_ids[row[0].value] = val

print count, 'rows found in total'
print has_pmid, 'of those rows have a PMID'
print has_pmid / float(count) * 100, 'per cent of their NCT IDs have a PMID, including duplicates'
print
unique_nct_ids = len(nct_ids.keys())
print unique_nct_ids, '*unique* NCT IDs found in all rows'
pmids_found = sum(1 for x in nct_ids.values() if x)
print pmids_found, 'of these have PMIDs'
print pmids_found / float(unique_nct_ids) * 100, 'per cent of unique NCT IDs have a PMID'


4347 rows found in total
2452 of those rows have a PMID
56.4067172763 per cent of their NCT IDs have a PMID, including duplicates

4092 *unique* NCT IDs found in all rows
2295 of these have PMIDs
56.0850439883 per cent of unique NCT IDs have a PMID

Compare with our data

Now examine:

  • of the NCT IDs for which BMJ authors find PubMed results, how many we also find PubMed results for
  • of the same dataset, how many only BMJ find results for
  • of the NCT IDs for which BMJ authors do not find PubMed results, how many we do find PubMed results

In [17]:
df_bmj = pd.Series(nct_ids).to_frame(name='pmid')
df_bmj['pubmed_results'] = ~df_bmj.pmid.isnull()
df_bmj.index.name = 'nct_id'
df_bmj.reset_index(inplace=True)
print len(df_bmj), 'NCT IDs in the full BMJ dataset'
# df_bmj.head(20)

merged_results = \
    pd.merge(df_bmj, df_completed, #[['nct_id', 'pubmed_results']], 
             on='nct_id', how='inner', suffixes=('_bmj', '_ours'))
    
# NB I tried this first with a left join: but 1521 out of the 4500 papers 
# don't appear in our dataset, because the BMJ authors' inclusion criteria are 
# different from ours. To get a sample after a left join...
# merged_results[merged_results.we_have_results.isnull()].head()

merged_results['we_have_results'] = ~merged_results.is_overdue
merged_results.we_have_results.value_counts(dropna=False)
# merged_results.head()
print len(merged_results), 'NCT IDs are in both the BMJ dataset and ours'

papers_both_find_pm_results = \
    merged_results[merged_results.pubmed_results_bmj & merged_results.we_have_results]
papers_both_find_pm_results.head()
print len(papers_both_find_pm_results), 'we both find results for'
papers_only_they_find_results = \
    merged_results[merged_results.pubmed_results_bmj & ~merged_results.we_have_results]
print len(papers_only_they_find_results), 'only they find results for'  
papers_only_we_find_results = \
    merged_results[~merged_results.pubmed_results_bmj & merged_results.we_have_results]
print len(papers_only_we_find_results), 'only we find results for'
noone_finds_results = \
    merged_results[~merged_results.pubmed_results_bmj & ~merged_results.we_have_results]
print len(noone_finds_results), 'neither of us find results for'


4092 NCT IDs in the full BMJ dataset
2541 NCT IDs are in both the BMJ dataset and ours
1164 we both find results for
461 only they find results for
409 only we find results for
507 neither of us find results for

In [18]:
# Examine a sample of the papers only they find results for.
cols = ['nct_id', 'title', 'pubmed_results_bmj', 'pmid', 'we_have_results']
papers_only_they_find_results.sample(10)[cols]


Out[18]:
nct_id title pubmed_results_bmj pmid we_have_results
758 NCT00324506 Safety and Efficacy of Cellcept and Avonex as ... True 21180632 False
500 NCT00218491 Effectiveness of N-Acetylcysteine (NAC) in Tre... True 23952889 False
2436 NCT01220310 Doctors and Web-based Self-management Support ... True 23031610 False
1323 NCT00529841 Research Study for Children With Salt Wasting ... True 16817818 False
332 NCT00145197 Reducing Underuse of Early-Stage Breast Cancer... True 19033569 False
42 NCT00027703 Combination Chemotherapy With or Without Bevac... True 22665541 False
1755 NCT00662103 Aerobic Exercise, Resistance Exercise, or Flex... True 20048529 False
1719 NCT00646204 Namenda (Memantine) for Non-Motor Symptoms in ... True 21193343 False
1309 NCT00524966 Randomized Controlled Trial of Bipolar Versus ... True 21072688 False
93 NCT00054418 Risedronate in Preventing Bone Loss in Premeno... True 19075260 False

In [19]:
# Papers only we find results for. If the `results_date` field exists, it 
# means that the results are published on ClinicalTrials.gov. Otherwise 
# we found results on PubMed but they did not - perhaps because
# it's been a couple of years since they did their search.
# We find 43 papers on PubMed that the BMJ authors don't: 
print len(papers_only_we_find_results), 'papers for which only we find results'
print len(papers_only_we_find_results[papers_only_we_find_results.results_date.isnull()]),\
    'of those we find on PubMed, the rest on ClinicalTrials.gov'
cols = ['nct_id', 'title', 'completion_date', 'pubmed_results_bmj', 
        'pmid', 'we_have_results', 'results_date']
# papers_only_we_find_results.sample(20)[cols]
papers_only_we_find_results[papers_only_we_find_results.results_date.isnull()].sample(10)[cols]


409 papers for which only we find results
45 of those we find on PubMed, the rest on ClinicalTrials.gov
Out[19]:
nct_id title completion_date pubmed_results_bmj pmid we_have_results results_date
1475 NCT00580528 Reducing Blood Pressure in Prehypertensive Old... 2010-05-01 False None True NaT
2195 NCT00925262 Controlled Trial of Mental Health Intervention... 2010-02-01 False None True NaT
315 NCT00135668 Dose-Response Study of Sodium Nitroprusside in... 2008-03-01 False None True NaT
2514 NCT01668303 Family-Based Juvenile Drug Court Services 2009-11-01 False None True NaT
1197 NCT00480441 Effectiveness Study of Dronabinol and BRENDA f... 2009-06-01 False None True NaT
306 NCT00133068 Collaboration to Reduce Disparities in Hyperte... 2008-08-01 False None True NaT
1932 NCT00739479 CCRC: Effects of Partially Hydrolyzed Whey Pep... 2010-08-01 False None True NaT
151 NCT00079547 The Safety and Effectiveness of Low and High C... 2008-03-01 False None True NaT
460 NCT00206232 Novel Treatment for Diastolic Heart Failure in... 2010-07-01 False None True NaT
645 NCT00276744 Individualized Drug Treatment Selection Proces... 2010-04-01 False None True NaT

In [20]: